# Prosper Loan Dataset Analysis by Gisela Chen

Goal: explore factors that affect BorrowerAPR.

This data set consists of 113937 loan listings with 81 variables describing each loan. To limit the initial EDA to only 10-15 variables, I will focus on analyzing the characteristics of the borrowers and find out what are the factors that might affect the APR of the loan.

Univariate Plots Section

First, I will look into the dataframe structure.

## 'data.frame':    113937 obs. of  15 variables:
##  $ ListingNumber            : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate      : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ Term                     : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState            : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation               : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus         : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ CreditScoreRangeLower    : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper    : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ DebtToIncomeRatio        : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ StatedMonthlyIncome      : num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...

To better reflect the nature of the data, I will change the data type of some columns from int to factor (ListingNumber and Term), convert ListingCreationDate from factor to date object, then change the ListingCategory from numeric to factor to make the interpretation more intuitive.

Now let’s look at the summary of the working dataset!

##  ListingNumber    ListingCreationDate  Term      
##  951186 :     6   Min.   :2005-11-09   12: 1614  
##  882888 :     4   1st Qu.:2008-09-19   36:87778  
##  892845 :     4   Median :2012-06-16   60:24545  
##  1056749:     4   Mean   :2011-07-08             
##  1057901:     4   3rd Qu.:2013-09-09             
##  875616 :     3   Max.   :2014-03-10             
##  (Other):113912                                  
##                  LoanStatus     BorrowerAPR               ListingCategory 
##  Current              :56576   Min.   :0.00653   DebtConsolidation:58308  
##  Completed            :38074   1st Qu.:0.15629   NotAvailable     :16965  
##  Chargedoff           :11992   Median :0.20976   Other            :10494  
##  Defaulted            : 5018   Mean   :0.21883   HomeImprovement  : 7433  
##  Past Due (1-15 days) :  806   3rd Qu.:0.28381   Business         : 7189  
##  Past Due (31-60 days):  363   Max.   :0.51229   Auto             : 2572  
##  (Other)              : 1108   NA's   :25        (Other)          :10976  
##  BorrowerState                      Occupation         EmploymentStatus
##  CA     :14717   Other                   :28617   Employed     :67322  
##  TX     : 6842   Professional            :13628   Full-time    :26355  
##  NY     : 6729   Computer Programmer     : 4478   Self-employed: 6134  
##  FL     : 6720   Executive               : 4311   Not available: 5347  
##  IL     : 5921   Teacher                 : 3759   Other        : 3806  
##         : 5515   Administrative Assistant: 3688                : 2255  
##  (Other):67493   (Other)                 :55456   (Other)      : 2718  
##  EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper
##  Min.   :  0.00           Min.   :  0.0         Min.   : 19.0        
##  1st Qu.: 26.00           1st Qu.:660.0         1st Qu.:679.0        
##  Median : 67.00           Median :680.0         Median :699.0        
##  Mean   : 96.07           Mean   :685.6         Mean   :704.6        
##  3rd Qu.:137.00           3rd Qu.:720.0         3rd Qu.:739.0        
##  Max.   :755.00           Max.   :880.0         Max.   :899.0        
##  NA's   :7625             NA's   :591           NA's   :591          
##  DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount
##  Min.   : 0.000    Min.   :      0     Min.   : 1000     
##  1st Qu.: 0.140    1st Qu.:   3200     1st Qu.: 4000     
##  Median : 0.220    Median :   4667     Median : 6500     
##  Mean   : 0.276    Mean   :   5608     Mean   : 8337     
##  3rd Qu.: 0.320    3rd Qu.:   6825     3rd Qu.:12000     
##  Max.   :10.010    Max.   :1750003     Max.   :35000     
##  NA's   :8554

It appeared odd to me that there were duplications in the listing number. The listing number should be unique since it represents each individual loan listings so I will take a closer look at some of them.

##       ListingNumber ListingCreationDate Term LoanStatus BorrowerAPR
## 13079        951186          2013-10-02   60    Current     0.16662
## 14889        951186          2013-10-02   60    Current     0.16662
## 20570        951186          2013-10-02   60    Current     0.16662
## 31451        951186          2013-10-02   60    Current     0.16662
## 42751        951186          2013-10-02   60    Current     0.16662
## 42752        951186          2013-10-02   60    Current     0.16662
##         ListingCategory BorrowerState Occupation EmploymentStatus
## 13079 DebtConsolidation            MD      Other         Employed
## 14889 DebtConsolidation            MD      Other         Employed
## 20570 DebtConsolidation            MD      Other         Employed
## 31451 DebtConsolidation            MD      Other         Employed
## 42751 DebtConsolidation            MD      Other         Employed
## 42752 DebtConsolidation            MD      Other         Employed
##       EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper
## 13079                       26                   720                   739
## 14889                       26                   720                   739
## 20570                       26                   720                   739
## 31451                       26                   720                   739
## 42751                       26                   720                   739
## 42752                       26                   720                   739
##       DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount
## 13079              0.41                3000              10000
## 14889              0.41                3000              10000
## 20570              0.41                3000              10000
## 31451              0.41                3000              10000
## 42751              0.41                3000              10000
## 42752              0.41                3000              10000

After checking the top three listings (shown above is the first example), it seems that they are indeed duplications of the same listing so I will keep only the unique loan listings which add up to 113066 observations.

Now let’s look at individual variables.

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2005-11-09" "2008-09-13" "2012-06-08" "2011-07-02" "2013-08-31" 
##         Max. 
## "2014-03-10"

The listings range from November 2005 to March 2014. There is a gap at year 2009 in which there were no listings at all, splitting the data into two parts. There is a sharp increase in the number of listings in 2013.

##    12    36    60 
##  1614 87224 24228

The majority of the loans is of 36-month term (87224) followed by 60-month term (24228) and 12-month term (1614), indicating that the loan dataset is comprised of short term loans.

The majority of the loans are either current or completed with remaining loans that are mainly charged off, defaulted or past due (Top plot). I decided to create a simplified column with only three levels: “Current”, “Completed” and “Bad_loan” (Bottom plot).

The most common reason for taking out a loan in the dataset is debt consolidation. There is also a fraction of loans that does not provide a concrete reason (Not available or other). Home improvement and business are also main listing categories in this dataset.

It looks like California has the most of the loan listings followed by Texas, Florida and New York.

This is a bit of a busy plot at the x-axis but it nicely demonstrates all flavors of occupation the borrowers had during loan application. Although the highest level is “other” and does not provide too much information, we can see that the top three occupations are professionals, computers programmers and executives.

I am quite confused with the classification of this category. More than half of the borrowers are employed, although from the choices (levels) of the employment status it is not clear what is the difference between “employed” vs. “full-time” or “Part-time’ so the proportion might be different depending on the interpretation. Overall most borrowers are employed.

I suspect that the choices in this category was changed at some point so I decide to look at the time distribution of the EmploymentStatus categories (see the chart below).

Unfortunately, the result is still confusing to me. For example, “Employed” and “Full-time” are two overlapping classifications to me and I was hoping to find that the distribution of the counts for these two classifications would be distinct (ie. one is collect at a period that is not overlapped to the other). However, it seemed that in the dataset both of them are used to classify loans that are collected at the same time and I have no further distinction on how the choice was made.

Next I will look at distribution of the numeric columns starting with APR.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15630 0.20980 0.21900 0.28390 0.51230      25

The APR histogram appears bimodal and has a broad, normally distributed first peak at 0.2%, near the median(red) and mean(green), and a sharp second peak at 0.36%.

There are 25 listings that are from the very beginning of the dataset that don’t have APR information. I will remove those listings since my focus is on characterizing the APR.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.06  137.00  755.00    7600

Employment status duration for the borrowers range from 0 month to 755 months. The original distribution was positively skewed with a long tail to the right (Top plot). Transformation by taking log10 seems to result in a slightly negatively skewed distribution (Middle plot). Transformation of the data by square rooting seems to reduce the longtailness (Bottom plot).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3198    4667    5604    6812 1750000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.505   3.669   3.615   3.833   6.243
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   56.55   68.31   70.66   82.54 1323.00

StatedMonthlyIncome also shows positive skewness (first summary) but can be tranformed to normal distribution by taking log10 (second summary) or square root (third summary).

I am puzzled by the large number of maximum StatedMonthlyIncome so decided to take a closer look.

##       ListingNumber ListingCreationDate Term LoanStatus BorrowerAPR
## 53015        694401          2013-01-04   12  Completed     0.25785
## 53168        560783          2012-02-20   36    Current     0.35797
##       ListingCategory BorrowerState   Occupation EmploymentStatus
## 53015        Business            CA        Other    Self-employed
## 53168        Business            OR Professional    Self-employed
##       EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper
## 53015                      117                   720                   739
## 53168                      114                   740                   759
##       DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount
## 53015                NA            618547.8               4000
## 53168                NA           1750002.9               4000

It appears that for a StatedMonthlyIncome greater than 500000 there are two loan lisitngs for businesses. The StatedMonthlyIncome might be large because it is representing income for the whole business and not individuals. These are likely true outliers.

This plot shows both the upper credit score in red and the lower credit score in blue. The trends are very similar so I will only keep the upper range scores.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   679.0   699.0   704.5   739.0   899.0     566
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   439.0   679.0   699.0   705.3   739.0   899.0

There are a few outliers with very low credit scores that negatively skewed the data slightly. After removing those with lower values (less than 400), the variable seemed normally distributed. I would think that the lower credit score is consistent with the fact that many of these loans were targeted for debt consolidation. I would be curious to find out if there is a correlation between low credit score and bad loans though.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8472
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.057   0.086   0.095   0.121   1.042    8472
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.374   0.469   0.482   0.566   3.164    8472

While most of the DTI are at 0.2, we are seeing outliers with big numbers of DTI. They are likely to be real because of the nature of our dataset so I will keep them. Data transformation can be used to normalize the data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6300    8316   12000   35000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   3.602   3.799   3.794   4.079   4.544
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   31.62   63.25   79.37   85.13  109.50  187.10

The distribution of loan original amount appear “spiky” with specific peaks presumably representing the amount for a specific type of loan. However, It looks to me that the original data appeared positively skewed and log transformation appear to help normalizing the dataset.

Univariate Analysis

What is the structure of your dataset?

Size

The original dataset contains 113937 rows representing loan listings with 81 variables describing each listing. I selected 15 variables for the initial EDA. I found that there were duplicated loan listings and deleted the duplications. The final working dataset contains 113066 observations of 14 variables (see last question for reason to delete one variable).

Variables

The variables can be grouped by their data types as follows:

Date: ListingCreationDate
Factor: ListingNumber, Term, LoanStatus, ListingCategory, BorrowerState, Occupation, EmploymentStatus
Integer: EmploymentStatusDuration, CreditScoreRangeLower, CreditScoreRangeUpper, LoanOriginalAmount
Number: BorrowerAPR, DebtToIncomeRatio, StatedMonthlyIncome

The distribution of some numeric variables appeared to be skewed but can be transformed to approach normal distribution.

Content

The loans in this dataset are short term loans of 12 months, 36 months or 60 months durations, ranging from 2005 to 2014. The amount of original loan range from 1,000 to 35,000. The top reason for taking out the loan is debt consolidation. Most of the borrowers are from California. Professionals and programmers came on top of the types of borrowers’ occupations.

The APR distribution appear bimodal with a broad first peak that appear normally distributed and a second sharp peak at 0.36% (located at the right of the first peak near the end of the distribution).

What is/are the main feature(s) of interest in your dataset?

The main feature of interst is the APR. I am interested in finding factors that affects the APR.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think that high credit score and low debt to income ratio would help bring down the APR. Employment stability and income can also be an important factor.

Did you create any new variables from existing variables in the dataset?

I converted Listing Category from numeric to factor type by mapping the number representing a category to the actual name of the listing category so that it would be more intuitive to interpret the outcome.

I also simplified the loan status variable to group all loans in question into “bad loans”.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I initially included two variables, UpperCreditScore and LowerCreditScore, regarding credit scores. After finding that the pattern are almost identical I have decided to include just one, resulting in 14 variables total in the working dataset.

I changed the data type for ListingNumber, ListingCreationDate and Term because it fits the nature of the data better. I also changed the levels in ListingCategory and LoanStatus to simplify the analysis.

I found duplicated listings that I have deleted to make sure that all listings were unique.

Overall the dataset appear to contain some NA or " " levels in categorical variables and some outliers in numerical levels. At this point I have decided to keep them because they appear to be true data points.

Bivariate Plots Section

Here I will focus on the following questions:

I. Which numeric variable has a correlation with APR?
II. How does APR change over time and other variables?
III. What are the characteristics of the current, completed and bad loans?

I. Which numeric variable has a correlation with APR?

To reduce computing time for initial analysis I created a smaller data set with only numeric variables and randomly selected ten thousand loan listings to generate correlations plots of those numeric variables against each other using ‘’’ggpairs’‘’. I also generated new numeric columns with modified values (’‘’log10’‘’ or’‘’sqrt’’’) to normalized some of the variables.

##   BorrowerAPR      EmploymentStatusDuration CreditScoreRangeUpper
##  Min.   :0.00653   Min.   :  0.00           Min.   : 19.0        
##  1st Qu.:0.15629   1st Qu.: 26.00           1st Qu.:679.0        
##  Median :0.20984   Median : 67.00           Median :699.0        
##  Mean   :0.21898   Mean   : 96.06           Mean   :704.5        
##  3rd Qu.:0.28386   3rd Qu.:137.00           3rd Qu.:739.0        
##  Max.   :0.51229   Max.   :755.00           Max.   :899.0        
##                    NA's   :7600             NA's   :566          
##  DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount    sqrt_ESD     
##  Min.   : 0.000    Min.   :      0     Min.   : 1000      Min.   : 0.000  
##  1st Qu.: 0.140    1st Qu.:   3198     1st Qu.: 4000      1st Qu.: 5.099  
##  Median : 0.220    Median :   4667     Median : 6300      Median : 8.185  
##  Mean   : 0.276    Mean   :   5604     Mean   : 8316      Mean   : 8.629  
##  3rd Qu.: 0.320    3rd Qu.:   6812     3rd Qu.:12000      3rd Qu.:11.705  
##  Max.   :10.010    Max.   :1750003     Max.   :35000      Max.   :27.477  
##  NA's   :8472                                             NA's   :7600    
##    log10_SMI       sqrt_DTIR    
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.505   1st Qu.:0.374  
##  Median :3.669   Median :0.469  
##  Mean   :3.615   Mean   :0.482  
##  3rd Qu.:3.833   3rd Qu.:0.566  
##  Max.   :6.243   Max.   :3.164  
##                  NA's   :8472

BorrowerAPR have some small but significant negative correlation with CreditScoreRangeUpper (-0.442) and LoanOriginalAmount (-0.325) but not with EmploymentStatusDuration, DebtToIncomeRatio or StatedMonthlyIncome. Additionally, LoanOriginalAmount has small but significant positive correlation with CreditScoreRangeUpper(0.344) and StatedMonthlyIncome(0.38). None of the modified variables showed a significant improvement in correlation.

## 
##  Pearson's product-moment correlation
## 
## data:  working_prosper_loan$BorrowerAPR and working_prosper_loan$CreditScoreRangeUpper
## t = -159.41, df = 112470, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4340443 -0.4245100
## sample estimates:
##        cor 
## -0.4292891

Here we can see the negatively trend between high credit score and borrower APR with a pearson’s coefficient of -0.429 for the entire dataset.

## 
##  Pearson's product-moment correlation
## 
## data:  working_prosper_loan$BorrowerAPR and working_prosper_loan$LoanOriginalAmount
## t = -114.48, df = 113040, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3275368 -0.3170890
## sample estimates:
##        cor 
## -0.3223227

We see that higher loan amounts tend to correlate with lower APR. This is specifically true when the loan amount is large. The discrete horizontal lines are due to the fact that loan amounts are given out at a specific number. The pearson’s coefficient for the entire dataset is -0.322.

The two major factors that correlate with APR in this dataset are CreditScoreRangeUpper and LoanOriginalAmount, which makes these two variables top candidates for building a model. I would also like to point out that StatedMonthlyIncome might also have a very mild effect on BorrwerAPR but the coefficient was very low at -0.165. I was surprised to find that DebtToIncomeRatio was not correlated with BorrowerAPR. I suspect that this factor is ignored because many of the loans are used for debt consolidation.

II. How does APR change over time and other variables?

Generally a line plot would be the best to visualize trends over time. However, because we are looking at individual loan listings, a scatter plot turned out to show the APR trend more clearly. We see nicely the upper bound and lower bound of APR over time. For example, during the first few months of data collection, the APR range from 0.03% to slightly over 0.5%. On the other hand, from 2011 to 2014, the lower bound is around 0.06% and the upper bound is around 0.36%. The extreme high and low rates also appear to be rates prior to 2009. The number of loan listings at a specific time is also highlighted by the different shades of grey, with darker dots representing higher number and vice versa.

The flactuation of rates at different times poses an inconsistency in which a high APR at one time could be only moderate in another time. The difference in the upper and lower bound of the APR could explain why the correlation between APR and other variables were not stronger.

## 
##  Pearson's product-moment correlation
## 
## data:  ListingCreationDateSubset$BorrowerAPR and ListingCreationDateSubset$CreditScoreRangeUpper
## t = -134.63, df = 28293, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6319356 -0.6177307
## sample estimates:
##        cor 
## -0.6248849

When I limit the dates to a period of time with similar upper and lower bound in APR (May 2006 to Nov. 2008), I see an increase in correlation coefficient to -0.625.

Another trend that I observed was that between Mar. 2011 and Jul. 2012, BorrowerAPR were offered at very fixed rates, indicated by the blank spaces flanked by dark dotted lines. This trend is very different than other times during the data collection.

The magenta dotted line marks the 0.36% APR which represented the second sharp peak observed in the BorrowerAPR histogram in the previous section. There is a period from Dec. 2010 to Dec. 2012 where 0.36% was offered extensively that could account for that peak.

Next I will look into how the APR trend is affected by other categorical variables.

## # A tibble: 3 × 4
##     Term      mean  median     n
##   <fctr>     <dbl>   <dbl> <int>
## 1     12 0.2162232 0.22189  1614
## 2     36 0.2195722 0.20984 87199
## 3     60 0.2170339 0.20987 24228

Here we see that 12-month term has slightly higher median than the other two terms but 36-month term has the higest mean (green dots). Overall the stats are very similar, suggessting that terms don’t affect the APR much.

12-month and 60 month loans were launched only after end of 2011. 12-month loans discontinued after 2013.

## # A tibble: 3 × 4
##   LoanStatus      mean   median     n
##       <fctr>     <dbl>    <dbl> <int>
## 1  Completed 0.2088913 0.195010 38239
## 2    Current 0.2139750 0.205240 55730
## 3   Bad_loan 0.2538349 0.256385 19072

Here we see that listings in the Bad_loan category has higher APR than the other two categories. One possible explanation for this difference is that the bad loans were issued only during the time when the rate was higher. Alternatively, it could also be that loans with higher APR tend to end up being bad loans. So I will look into how these loan types were distributed over time.

Bad loans ranged from the beginning to the end of the data collection time. There were also more loan listings with lower APR in the completed loans than bad loans, suggesting a correlation between high APR and bad loans.

## # A tibble: 21 × 3
##      ListingCategory   median     n
##               <fctr>    <dbl> <int>
## 1       PersonalLoan 0.176260  2395
## 2       NotAvailable 0.185250 16940
## 3               Boat 0.204620    85
## 4    BabyAndAdoption 0.204930   196
## 5  DebtConsolidation 0.206260 57624
## 6         StudentUse 0.214505   756
## 7     LargePurchases 0.218580   863
## 8           Business 0.219450  7157
## 9    HomeImprovement 0.223620  7388
## 10        Motorcycle 0.230825   304
## # ... with 11 more rows

BorrowerAPR differed depending on the ListingCategory. Personal loan and Not available are the two categories that has the lowest APR, much less than the overall APR median (red dashed line); whereas Household expenses and Cosmetic procedures are the two categories with the highest APRs, way above the overall median.

When we split the time analysis data into listing categories, we see the following: 1. Loans in personal loan and student use category were once offered but discontinued in the middle of the data collection period. 2. Loans in debt consolidation, home improvement, business, auto and other category were offered consistently throughout the data collection period 3. There were a series of new loans offered after 2012, among those the more popular ones include Medical or dental, wedding loans, household expenses, etc. The medians of these different types of loans vary drastically. 4. Most of the Not available listings were collected before 2008 and all others were collected only after 2008, suggesting that this variable was not collected until after 2008.

Combining the two plots together we can also find out that the rates before 2008 have lower top and bottom limits, which might explain why the Not available listing has lower median. Similarly, there were more personal loans with low APR given out, resulting in a lower median.

Maine and Iowa have the lowest APRs (with outliers) and Arkansas and Alabama have the highest APRs. At first I thought it was due to the state usury limit, but after checking this article I didn’t find a correlation. For example, the legal rate of interest for ME, AR and AL are all 6%. Overall, the BorrowerAPR varies from state to state.

When we split the time series data into borrower states we can see the following, 1. Loans offered to IA, ME and ND were discontinued after 2009. In contrast, loans offered to SD only started after 2009. 2. Early discontinuation of loans in ND and IA could count for the missing peak at 0.36% in the APR histogram in previous section. 3. There seems to be an increase in issueing many more loans after 2012 in most of the states. 4. The NA listings seem to exist only prior to 2008, similar to Listing Category. 5. Before 2008, BorrowerAPR seemed drastically different from state to state (eg.CA vs CO). However, after 2008 the high and low limits of BorrowerAPR are very similar.

When we look at this plot we can also see individual reasons for why the median BorrowerAPR is different for each state. For example, the low rates of Maine and Iowa are due to the fact that the loans were issued at a time when the rates were low and discontinued when the rates went up.

Judge and Doctor receive the lowest APR. Nurse’s Aide and Teacher’s Aide receive the highest APR. Again we see that BorrowerAPR varies among different occupations adn seems to be related to the socialeconomic status of the borrower.

The data collection for Occupation seems quite completed with most of the levels covering from 2006 to 2014. There are two groups of NAs at the beginning (2006 - 2008) and end (2014) of the data collection time.

Part-time and Full-time status have lowest APRs whereas Other and Not employed have highest APRs.

Here we see again that the definition of levels in this category is confusing. There is almost no data for “Employed” before 2010, which was probably classified as “Full-time” based on the density (shades of gray) of the loan listings. There is also decreased number of loans in “part-time” and “retired” after 2011 and it’s not clear if the company decided not to offer loans to these categories or were they included in “others”, another category that began to accumulate data only after mid-2010.

One thing that I noticed is that listings after 2009, with borrowers that are not employed, consistently had higher APRs on the lower range, set arbitrarily at 0.1% (pink dotted line) when compared to the other levels in this variable. Additionally, there seems to be an increase of issueing loans to self- employed borrowers after 2011.

In this section we look into how loan listing were offered regionally, over time, and to what type of borrowers. We learn that BorrowerAPR varies from level to level in ListingCategory, BorrowerState, Occupation, LoanStatus and EmploymentStatus but remains similar in Term. Furthermore, we learn that data points in Occupation, Term and LoanStatus are collected consistently throughout the data collection period but not in BorrowerState, ListingCategory and EmploymentStatus, resulting in more missing values in the later variables.

III. Characteristics of bad loans

In previous section we see that BorrowerAPR is higher in the Bad_loan level so I want to look further into other variables that show a difference to futher characterize the bad loan category.

## # A tibble: 3 × 3
##   LoanStatus median     mean
##       <fctr>  <dbl>    <dbl>
## 1  Completed    699 704.6653
## 2    Current    719 717.8541
## 3   Bad_loan    679 664.9343

Both mean and median is lower for Bad_loans than for Completed and Current loans.

Here we can see that the low credit scores are from between 2006 to 2007.

A zoom-in of CreditScoreRange shows that Current loans has the highest median of credit scores followed by Completed then Bad_loan.

We know from previous session that the distribution is very positively skewed. It’s hard to see the bulk of the data because of the outliers so I will zoom-in first.

## # A tibble: 3 × 4
##   LoanStatus median      mean     n
##       <fctr>  <dbl>     <dbl> <int>
## 1  Completed   0.19 0.2641048 35490
## 2    Current   0.23 0.2619848 51712
## 3   Bad_loan   0.22 0.3425138 17367

Current and Bad_loan have similar medians but Bad_loan has a much higher mean, suggesting that it might have more data points with higher DTI ratio. To test that, I decide to look at the summary of data points with DTI ratio greater than one.

## # A tibble: 3 × 4
##   LoanStatus median     mean     n
##       <fctr>  <dbl>    <dbl> <int>
## 1  Completed   3.02 5.067957   346
## 2    Current   1.67 3.526225   151
## 3   Bad_loan   5.10 5.734488   301

When I look at median and mean of DTI ratios greater than one, I see that Bad_loan has higher median and mean than the other two categories.

## # A tibble: 3 × 4
##   LoanStatus   median     mean     n
##       <fctr>    <dbl>    <dbl> <int>
## 1  Completed 4416.667 5326.269 38239
## 2    Current 5166.667 6155.481 55730
## 3   Bad_loan 3750.000 4549.716 19072

Bad_loan has the smallest median and mean for StatedMonthlyIncome, suggesting that borrowers in this group tend to have less monthly income.

## # A tibble: 3 × 4
##   LoanStatus median      mean     n
##       <fctr>  <dbl>     <dbl> <int>
## 1  Completed   4500  6201.174 38239
## 2    Current  10000 10346.692 55730
## 3   Bad_loan   4500  6620.950 19072

Current loans have the highest median and mean followed by Completed and Bad_loans. This could be because both Completed and Bad_loans have more data points from earlier times when the loan amount was less. To check it I will only look at data points that are after 2009-08-01.

## # A tibble: 3 × 4
##   LoanStatus median      mean     n
##       <fctr>  <dbl>     <dbl> <int>
## 1  Completed   5000  7247.232 13436
## 2    Current  10000 10345.499 55700
## 3   Bad_loan   5000  7137.587  7131

It doesn’t appear to be the case as advancing the listing creating date only increase a little of the median and the mean of Completed and Bad_loans but they didn’t catch up with Current loans.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  1. BorrowerAPR vs. numerical variables:

APR is negatively correlated with CreditScoreRangeUpper (-0.442) and with LoanOriginalAmount (-0.325).

  1. BorrowerAPR vs. time and categorical variables:

APR is affected by ListingCreationTime,, BorrowerState, Occupation and EmploymentStatus but not Term.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

  1. Additional numerical correlations:

LoanOriginalAmount has low but significant correlation with CreditScoreRangeUpper (0.344) and StatedMonthlyIncome(0.38)

  1. Characteristics of Bad_loan listings:

The Bad_loan in LoanStatus has lower mean/median in CreditScoreRangeUpper, StatedMonthlyIncome and LoanOriginalAmount but higher mean/median in BorrowerAPR and DebtToIncomeRatio for listings with DTI ratio greater than one.

  1. Time analysis summary:

By looking at the variables over time I found that the maximum/minimum of APR change over time and could be an influencer of APR correlation with other variables (eg. by restricting to listings offered at a smaller time window with similar APR max/min I can improve the correlation coefficient). In addition, I found changes in data collection and loan offering strategies that would help in refining the analysis strategy down the road.

What was the strongest relationship you found?

Despite the fact that this dataset contained many loans with bad credit scores, I found that BorrowerAPR negatively correlated with CreditScoreRangeUpper the most. This correlation can be further strengthened by looking at specific time windows during which the max/min APRs are similar.

Multivariate Plots Section

I would imagine that for a loan company, what they want to see least is the occurrence of bad loans. Therefore in this section, I would like to focus on looking at what are the characteristics of bad loans from a few different angles.

I. How is LoanStatus distributed on the APR vs. CreditScoreUpper scatterplot?

From previous section we saw that BorrowerAPR and CreditScoreUpper had the best negative correlation so I would like to find out how the distribution of LoanStatus is within this correlation.

At this resolution, we see that both the Completed and Bad_loan classes have a more diverse distribution with no particular pattern. The Current class is more centered on the upper half of the distribution, indicating that current loans have a higher requirement of CreditScoreRange.

II. What is the trend of average BorrowerAPR of Bad_loans over time?

We saw that BorrowerAPR is higher in the Bad_loan category. Did this happen throughout the data collection time or was it because of some specific period that had a very high rate? To find out, I will look at the trend of BorrowerAPRs of different LoanStatus over time.

Despite a similar trend to the Completed listings, the Bad_loan listings had an overall higher BorrowerAPR all the time.

Is this true when we split the data into different categorical variables?

After splitting the data into subgroups, some levels with few data points such as boat or RV did not produce the curves or had curves with high deviation ( overfitting). When we focus on levels with sufficient data points we see that all of them had Bad_loan listing with higher BorrowerAPR just like we saw earlier.

There seemed to be more variation of LoanStatus pattern but overall the Bad_loan listings have higher BorrowerAPRs with the exception of AK.

Even though the difference is not as obvious after splitting, most levels in Occupation have higher BorrowerAPR in Bad_loan listings with the exception of flight Attendant.

Overall the higher BorrowerAPR of Bad_loans persisted over time across most of the categorical variables.

III. What is the percent of bad_loans in different listing categories
across different states?

I am curious to see if there are spicific traits for Bad_loan listings. For example, is there a certain listing category type in a specific state that has a higher chance to become a Bad_loan? or, is there an occupation within a listing category with higher percentage of Bad_loan?

## Source: local data frame [636 x 5]
## Groups: ListingCategory, BorrowerState [636]
## 
##      ListingCategory BorrowerState LoanStatus count      ratio
##               <fctr>        <fctr>     <fctr> <int>      <dbl>
## 1       NotAvailable                 Bad_loan  1623 0.29785282
## 2       NotAvailable            CA   Bad_loan  1030 0.47884705
## 3  DebtConsolidation            CA   Bad_loan   700 0.09205681
## 4       NotAvailable            TX   Bad_loan   534 0.53993933
## 5       NotAvailable            GA   Bad_loan   469 0.50214133
## 6       NotAvailable            IL   Bad_loan   401 0.47176471
## 7  DebtConsolidation            FL   Bad_loan   382 0.11225389
## 8  DebtConsolidation            IL   Bad_loan   332 0.10946258
## 9  DebtConsolidation            NY   Bad_loan   302 0.08079187
## 10 DebtConsolidation            GA   Bad_loan   300 0.13215859
## # ... with 626 more rows

I first grouped the data by ListingCategory then by BorrowerState. I calculated the ratio of each levels in LoanStatus. I then selected a table with only Bad_loan listings and use the tile plot to visualize the ratio of Bad_loans in color with the corresponding number of Bad_loan listings in white.

As shown in this plot, each tile represent a ListingCategory type (x-axis) in a specific state (y-axis). Empty tiles indicate that there is no such subcategory or that there is no Bad_loan listings in that subcategory.

A few trends can be observed from the plot:

  1. All light blue tiles (those with higher percentage of bad loan) has relatively small sample size, suggesting that these are events that most likely happened by chance and not due to the possibility that a specific category in a state is more prone to result in bad loans. However, the State of SD seemed to have a higher number of light blue tiles that I would recommend to pay a closer attention.

  2. The left side of the figure has less numbers of listings than the right side of the figure presumably due to the fact that these categories were introduced later in the dataset.

  3. DebtConsolidation has mostly dark blue tiles, indicating that the company is doing a good job in managing their main sector of business.

  4. PersonalLoan and StudentUse has more lighter blue tiles than other categories which might explain why these two categories were discontinued.

Taken together, this will be a good figure to use for following up on the performance of the loans in a more detailed manner.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I focused on investigating the relationship between BorrowerAPR and loan categories (Current, Completed and Bad_loans). I found that Current loans have a higher correlation between BorrowerAPR and CreditScoreUpper. Additionally, I also found that the Bad_loans consistently associated with a higher BorrowerAPR over time.

Were there any interesting or surprising interactions between features?

After breaking down the number of BadLoans (both the ratio and the absolute number) over BorrowerStates and ListingCategory, I could see trends of how the company was managing their loan listings. For example, on one hand, loan categories that had higher ratio of bad loans were discontinued; on the other hand, the company maintained a low ratio of bad loans for the main ListingCategory of their business.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I did not create a model because the linear relationship between BorrowerAPR and other features is not very strong. Given that there are many features in the dataset, I would probably try some feature reduction methods such as Principle Coponent Analysis to further narrow down features that might affect the BorrowerAPR for model construction (See reflection).


Final Plots and Summary

Plot One

Description One

There is a moderate negative trend between the borrower’s credit score and the APR of the loan. Note that in the bottom of this plot there are some cases of loans whose borrowers’ credit scores are very poor. This could be due to the nature of the dataset because a moajority of the loans are given out for debt consolidation.

Plot Two

Description Two

In this time-series analysis that tracks borrowers’ APR distribution, we see that loans that have bad standing have hiher APR than completed or current loans listings.

Plot Three

Description Three

A tile plot of bad loan ratio shown in different shades of blue in different listing category across different states. The numbers represent the actual number of bad loan listings in each category at each state. We see that for the major category of this dataset, debt consolidation, although the actual numbers of bad loans are higher, the ratio is relatively low in the total amount of loan listings in that category.


Reflection

I chose this data set to familiarize myself with analyzing real world data. The reflections along each step of the EDA are summarized below.

1. Data wrangling:

As a personal taste this is my favorite part of the process. I enjoyed finding discrepancies in the dataset and trouble-shoot to see what caused it and how to fix it.

This is a pretty organized data set in the sense that a csv file with variables were already available. Neverthless, a few treatments were needed before the data set was ready for analysis. There were duplicates in the dataset that needed to be taken care of. The classes of some categorical variables were assigned at different time of data collection and the incoherence inteferes with interpretation of data analysis.

The data set included data points collected over a long period of time during which many of the numerical variables (eg. APRs, Credit Scores) might have fluctuated significantly. This might cause difference in interpretation. For example, what was considered a low APR rate at one time might be might high in another time. For a more consistent analysis further down the road, I would consider using some max./min. reduction to create new variables with normalized numbers.

2. Data analysis:

In this section I tried different plots and plotting strategies (eg., sorting by quantity) to better data visualization and analysis. Because this is my first time to use R, I spent a lot of time googling about how to code to get the result I wanted and I was pretty happy with what I could achieve. One important aspect of data analysis that I left out was statistical analyses to see if the differences that I was seeing was statistically significant. However, given that most EDA that I saw didn’t include statistical analysis and that the report was becoming too long. I have decided to leave it out. Many of the comparisons can be done with student T tests or 3-way ANOVA tests.

A surprising finding that I had while analyzing this particular dataset was that borrowers with no income or low credit score was able to get a loan. After researching more into the topic, some factors that might have contributed to it include collaterals or prior relationship between the client and the company. In the original dataset, there were features named “Investors”, “InvestmentsFromFriends”, “recommendations” and “IsBorrowerHomeowner” that can be investigated further to see if they also affect the BorrowerAPR.

3. Model building:

Given that this is a data set with many variables, there are different questions that one can ask for model building. For example, we can try to build a linear regression model that can predict what would be a good APR rates for an applicant if we can find a numerical variable that strongly correlate with the APR. However, I would proceed with caution because the disbribution of BorrowerAPR has to be normal to build the model, which is not the case for this dataset.

Alternatively, we can build a logistic regression model that can predict if a loan applicant is likely to successfully pay off the loan given the features provided. Many of the features indeed showed a difference in the Bad_loans subgroup. Further statistical analysis (t tests) can be conducted to see which difference is significant to proceed to the next step. Lastly, dimentionality reduction techniques can be used to further narrow down dimentions (features) that are important in defining the dataset for model building.